Section Requirement: Create a logistic regression model and a support vector machine model for theclassification task involved with your dataset. Assess how well each model performs (use 80/20 training/testing split for your data). Adjust parameters of the models to make them more accurate. If your dataset size requires the use of stochastic gradient descent, then linear kernel only is fine to use. That is, the SGDCClassifier is fine to use for optimizing logistic regression and linear support vector machines. For many problems, the SGD will be required in order to train the SVM model in a reasonable timeframe.
The following is code for pre-processing, defined functions used, logistic regression, and support vector machines.
import pandas as pd
import numpy as np
##Load original dataset. No longer needed.
##The cleaning process is very computationally expensive therefore a the 'ranifall.csv' file was created for later use.
#rainfall_original = pd.read_csv('weatherAus.csv')
##load pre-generated rainfall.csv file.
rainfall = pd.read_csv('rainfall.csv', index_col=0)
rainfall.info()
#Functions to find the average value using the bracketing values around the NaN's.
#For instance if a city's 'MinTemp' has 34, 32, NaN, NaN, 55 recorded
#the function will average 32 and 55 for the first NaN: (32+55)/2 = 43.5
#and average the above value and 55 for the second NaN: (43.5+55)/2 = 49.25
#Will only use values if they are from the same city.
#If NaN is the earliest timepoint for a given city the next timepoint with no NaN will be given instead of the mean.
#If NaN is the latest timepoint for a given city the previous timepoint with no NaN will be given instead of the mean.
def impute_by_city(cities,variables):
for c in cities:
#aPrse out observations from a single city.
temp = rainfall[rainfall.Location == c]
#Interate through all observations of the temp data file.
i = min(temp.index)
while i <= max(temp.index):
for v in variables:
#Check to see if there are values recorded for the variable, will pass over if all are NaN.
if pd.isna(temp[v]).all():
pass
#Check to see if a single value is NaN.
elif pd.isna(temp[v][i]):
#Find the mean of bracketing values and impute into main dataframe.
temp[v][i] = find_mean(temp[v], i)
rainfall[v][i] = temp[v][i]
i = i + 1
#Find mean of bracketing values.
def find_mean(templist, index):
#If NaN is earliest timepoint for the city take the next value that is not NaN.
if index == min(templist.index):
return find_top(templist, index)
#If latest timepoint for the city take the previous value that is not NaN.
elif index == max(templist.index):
return find_bottom(templist, index)
else:
#Find previous non-NaN value.
bottom = find_bottom(templist, index)
#Find next non-NaN value.
top = find_top(templist, index)
#If current value is not from the latest timepoint for the city but there are no more non-NaN value recorded
#after this value then the previous non-NaN value will be taken.
if pd.isna(top):
return bottom
else:
mean = (top + bottom)/2
return mean
#Find previous non-NaN value.
def find_bottom(templist, index):
while pd.isna(templist[index-1]):
index = index-1
bottom = templist[index-1]
return bottom
#Find next non-NaN value.
#If there are no more non-NaN values return the previous non-NaN value.
def find_top(templist, index):
while pd.isna(templist[index+1]):
index = index+1
if index == max(templist.index):
top = np.nan
return top
top = templist[index+1]
return top
##Code for first run data cleaning, no longer needed after 'rainfall.csv' was created at the end of cleaning process.
#rainfall = rainfall_original.copy()
##'RISK_MM' was used by creator of dataset to extrapolate response variable, 'RainTomorrow.' Needs to be dropped to not
##influence prediction.
#rainfall.drop(["RISK_MM"], axis=1, inplace=True)
##Drop any observation with no record of rainfall for the day. Cannot be imputed.
#rainfall.dropna(subset=["RainToday"], inplace=True)
#Reset the Index of each observation to match it's iloc, get rid of gaps between Index integers.
#rainfall = rainfall.reset_index(drop=True)
#rainfall.info()
##can be skipped if rainfall.csv already generated!
##set the cardinal directions to degrees.
#directions = {'N':0, 'NNE':22.5, 'NE':45, 'NE':45, 'ENE':67.5, 'E':90, 'ESE':112.5, 'SE':135, 'SSE':157.5, 'S':180,\
# 'SSW':202.5, 'SW':225, 'WSW':247.5, 'W':270, 'WNW':292.5, 'NW':315, 'NNW':337.5}
##Replace cardianl direction to their corresponding degrees.
#rainfall = rainfall.replace(directions)
#Get name of all cities in the data frame.
#cities = rainfall.Location.unique()
#c_variables = []
#d_variables = []
##change 'Yes' and 'No' to 1 and 0 respectively.
#rainfall.RainToday = rainfall.RainToday=='Yes'
#rainfall.RainToday = rainfall.RainToday.astype(np.int)
#rainfall.RainTomorrow = rainfall.RainTomorrow=='Yes'
#rainfall.RainTomorrow = rainfall.RainTomorrow.astype(np.int)
##Find all variables with continous data type.
#for l in list(rainfall):
# if (rainfall[l].dtypes == 'float64'):
# c_variables.append(l)
# else:
# d_variables.append(l)
##can be skipped if rainfall.csv already generated! Very expensive, 'rainfall.csv' can be uploaded from working directory
##Impute values to NaN's and save to csv file for later use.
#impute_by_city(cities, c_variables)
#rainfall.to_csv("rainfall.csv", sep=',', index=True)
#Variables 'Evaporation' and 'Sunshine' contained many missing values, too many to be imputed.
rainfall = rainfall.drop(['Evaporation', 'Sunshine'], axis = 1)
#Get name of all cities in the data frame.
l = list(rainfall.Location.unique())
#Drop all observations with NaN's. These are values that could not be imputed using the above code.
rainfall.dropna(subset = list(rainfall), inplace = True)
#List all cities that were dropped
for i in l:
if i not in rainfall.Location.unique():
print(i)
#'Date' and 'Location' variables not needed for prediction.
rainfall = rainfall.drop(['Date', 'Location'], axis = 1)
##Make a copy of main data frame to be use by Logistic Regression
lr_rainfall = rainfall.copy()
##Make a copy of main data frame to be use by Support Vector Machines
svm_rainfall = rainfall.copy()
from sklearn.model_selection import ShuffleSplit as ss
#Assign values to response variable, y, and explanatory variables, x.
if 'RainTomorrow' in lr_rainfall:
#Response variable is 'RainTomorrow'
y = lr_rainfall['RainTomorrow'].values
#Remove response variable from dataframe
del lr_rainfall['RainTomorrow']
#Everything else is the explanatory variables used in prediction.
x = lr_rainfall.values
#Split our data into training and testing sets, 80% of data will be in the training set and 20% the testing set.
#Data will be process this way 5 times, value can be change per user's judgement. It is recommended that number
#of iterations be at least 2 so that standard deviations can be computed.
num_cv_iterations = 5
num_instances = len(y)
cv_object = ss(n_splits=num_cv_iterations, test_size = 0.2)
from sklearn import metrics as mt
from sklearn.preprocessing import StandardScaler as sts
from sklearn.linear_model import LogisticRegression as lr
import time
column_names = lr_rainfall.columns
weights = []
weights_array = []
scl_obj = sts()
t0=time.time()
#Split the data into training and testing set 5 different ways, iterate through each way.
for iter_num, (train_indices, test_indices) in enumerate(cv_object.split(x,y)):
#Standardize the explanatory variables of the training and testing sets' means to be around 0 with a standard deviation of 1.
#Each value is subtracted from the mean and divided by the standard deviation of the whole dataset.
scl_obj.fit(x[train_indices])
X_train_scaled = scl_obj.transform(x[train_indices])
X_test_scaled = scl_obj.transform(x[test_indices])
#Perform Logistic Regression on training set.
lr_clf = lr(penalty='l2', C=0.05)
lr_clf.fit(X_train_scaled,y[train_indices])
#Perform prediction using the scaled explanatory variables of the testing set.
y_hat = lr_clf.predict(X_test_scaled)
#Find accuracy and confusion matrix of predition above
acc = mt.accuracy_score(y[test_indices],y_hat)
conf = mt.confusion_matrix(y[test_indices],y_hat)
print("")
print('accuracy:', acc )
print(conf )
print ("Time to Run:", time.time()-t0)
#Get the names of each variables and weight computed by the regression model
zip_vars = pd.Series(lr_clf.coef_[0].T, index=column_names)
for name, coef in zip_vars.items():
#Print out names and weight of each variable
print(name, 'has weight of', coef)
#Add weights computed by current iteration of training and testing split.
weights.append(coef)
#Add all the weights of each iteration into one master array.
weights_array.append(weights)
#reset weights variable for next iteration.
weights = []
#Convert weights_array into a numpy array.
weights_array = np.array(weights_array)
import plotly as ply
#Run at the start of plotly.
ply.offline.init_notebook_mode()
#Get mean weights for each variable
mean_weights = np.mean(weights_array,axis = 0)
#Get standard deviation of weights for each variable
std_weights = np.std(weights_array,axis = 0)
#Make one array with variable names as the Index, and a column of means and a column of standard deviation for each variable.
final_array = pd.DataFrame(data={'mean':mean_weights, 'std':std_weights}, index = column_names)
#Sort the variable in ascending order using the 'mean' column.
final_array = final_array.sort_values(by=['mean'])
#Error bar.
error_y=dict(type='data', array=final_array['std'].values, visible=True)
#Graph mean with standard deviation of each variable.
graph1 = {'x': final_array.index, 'y': final_array['mean'].values, 'error_y':error_y,'type': 'bar'}
fig = dict()
fig['data'] = [graph1]
fig['layout'] = {'title': 'Logistic Regression Weights, with error bars'}
ply.offline.iplot(fig)
#Grab coefficents with absolute values greater than 0.5, this can be set by the user.
cutoff = 0.5
lr_voi = []
for index, columns in final_array.iterrows():
if (columns['mean'] > cutoff) or (columns['mean'] < -cutoff):
lr_voi.append(index)
from matplotlib import pyplot as plt
#Add response variable back to expalantory dataset.
lr_rainfall['RainTomorrow'] = y
#Group observations by the response variable, 1's together and 0's together.
df_grouped = lr_rainfall.groupby(['RainTomorrow'])
#Plot kernal desity extimation of variables with weights higher than the user defined value set above, 0.5 in this case.
vars_to_plot = lr_voi
for v in vars_to_plot:
plt.figure(figsize=(10,4))
plt.subplot(1,1,1)
ax = df_grouped[v].plot.kde()
plt.legend(['no rain','rained'])
plt.title(v+' (Original)')
#Assign values to response variable, y, and explanatory variables, x.
if 'RainTomorrow' in svm_rainfall:
#Response variable is 'RainTomorrow'
y = svm_rainfall['RainTomorrow'].values
#Remove response variable from dataframe
del svm_rainfall['RainTomorrow']
#Everything else is the explanatory variables used in prediction.
x = svm_rainfall.values
#Split our data into training and testing sets, 80% of data will be in the training set and 20% the testing set.
#Data will be process this way 5 times, value can be change per user's judgement. It is recommended that number
#of iterations be at least 2 so that standard deviations can be computed.
num_cv_iterations = 5
num_instances = len(y)
cv_object = ss(n_splits=num_cv_iterations, test_size = 0.2)
from sklearn.svm import SVC
weights = []
weights_array = []
t0=time.time()
#Split the data into training and testing set 5 different ways, iterate through each way.
for train_indices, test_indices in cv_object.split(x,y):
#Standardize the explanatory variables of the training and testing sets' means to be around 0 with a standard deviation of 1.
#Each value is subtracted from the mean and divided by the standard deviation of the whole dataset.
scl_obj.fit(x[train_indices])
X_train_scaled = scl_obj.transform(x[train_indices])
X_test_scaled = scl_obj.transform(x[test_indices])
#Perform Support Vector Machine on training set.
svm_clf = SVC(C=0.5, kernel='linear', degree=3, gamma='auto')
svm_clf.fit(X_train_scaled, y[train_indices])
#Perform prediction using the scaled explanatory variables of the testing set.
y_hat = svm_clf.predict(X_test_scaled)
#Find accuracy and confusion matrix of predition above
acc = mt.accuracy_score(y[test_indices],y_hat)
conf = mt.confusion_matrix(y[test_indices],y_hat)
print("")
print('accuracy:', acc )
print(conf)
print ("Time to Run:", time.time()-t0)
#Get the names of each variables and weight computed by the regression model
zip_vars = pd.Series(svm_clf.coef_[0],index=column_names)
for name, coef in zip_vars.items():
#Print out names and weight of each variable
print(name, 'has weight of', coef)
#Add weights computed by current iteration of training and testing split.
weights.append(coef)
#Add all the weights of each iteration into one master array.
weights_array.append(weights)
#reset weights variable for next iteration.
weights = []
#Convert weights_array into a numpy array.
weights_array = np.array(weights_array)
#look at the support vectors
print(svm_clf.support_vectors_.shape)
print(svm_clf.support_.shape)
print(svm_clf.n_support_ )
#Run at the start of plotly.
ply.offline.init_notebook_mode()
#Get mean weights for each variable
mean_weights = np.mean(weights_array,axis = 0)
#Get standard deviation of weights for each variable
std_weights = np.std(weights_array,axis = 0)
#Make one array with variable names as the Index, and a column of means and a column of standard deviation for each variable.
final_array = pd.DataFrame(data={'mean':mean_weights, 'std':std_weights}, index = column_names)
#Sort the variable in ascending order using the 'mean' column.
final_array = final_array.sort_values(by=['mean'])
#Error bar.
error_y=dict(type='data', array=final_array['std'].values, visible=True)
#Graph mean with standard deviation of each variable.
graph1 = {'x': final_array.index, 'y': final_array['mean'].values, 'error_y':error_y,'type': 'bar'}
fig_svm = dict()
fig_svm['data'] = [graph1]
fig_svm['layout'] = {'title': 'Support Vector Machines Weights, with error bars'}
ply.offline.iplot(fig_svm)
#Grab coefficents with absolute values greater than 0.5, this can be set by the user.
cutoff = 0.5
svm_voi = []
for index, columns in final_array.iterrows():
if (columns['mean'] > cutoff) or (columns['mean'] < -cutoff):
svm_voi.append(index)
#Make a dataframe of the training data from the last iteration above.
df_tested_on = svm_rainfall.iloc[train_indices]
#Get the support vectors from the trained model.
df_support = df_tested_on.iloc[svm_clf.support_,:]
#Add response variable back to expalantory datasets.
df_support['RainTomorrow'] = y[svm_clf.support_]
svm_rainfall['RainTomorrow'] = y
#Group observations by the response variable, 1's together and 0's together.
df_grouped_support = df_support.groupby(['RainTomorrow'])
df_grouped = svm_rainfall.groupby(['RainTomorrow'])
#Plot kernal desity extimation of variables with weights higher than the user defined value set above, 0.5 in this case.
vars_to_plot = svm_voi
for v in vars_to_plot:
plt.figure(figsize=(10,4))
#Plot support vector stats.
plt.subplot(1,2,1)
ax = df_grouped_support[v].plot.kde()
plt.legend(['no rain','rained'])
plt.title(v+' (Instances chosen as Support Vectors)')
#Plot original distributions
plt.subplot(1,2,2)
ax = df_grouped[v].plot.kde()
plt.legend(['no rain','rained'])
plt.title(v+' (Original)')
Section Requirements: Discuss the advantages of each model for each classification task. Does one type of model offer superior performance over another in terms of prediction accuracy? In terms oftraining time or efficiency? Explain in detail.
Both Logistic Regression (LR) and Support Vector Machine (SVM) are algorithms used for classification problems. LR and SVM with linear Kernel generally perform comparably in practice. While this is true in practice, there are 2 differences to note:
For the purposes of the "Predict Rain Tomorrow in Australia" dataset, the accuracy for both LR and SVM hovered around 85% - no superior performance detected from one to the other. This is probably the case since our pre-processing removed any outliers from the data set. We ran the alorithm 5 times for both to see if it would have any signifiant deviation - both cases proved to have the same accuracy. What was noticable was the amount of time required to run LR and SVM. We included a time gauge to measure the amount of time it required, here were the results:
Logistic Regression (seconds)
Support Vector Machine (seconds)
This was a large dataset so SVM took 464 times longer to complete for all 5 runs when compared to LR. Being that both performed near identical in accuracy, we conclude LR would be more appropriate for this dataset in terms of time and resources required to complete the test.
Section Requirements: Use the weights from logistic regression to interpret the importance of differentfeatures for each classification task. Explain your interpretation in detail. Why do you thinksome variables are more important?
Weights depect the association between each of the predictors and what we are trying to predict - will it rain in Australia tomorrow. However, in LR and SVM, it would be tempting to conclude that variables with larger weights/coefficients are more important because it points to more signficant chances of it rains or not. Before reaching a conclusion, it is important to identify whether all the predictors are in the same units - which was not for this data set. In reality, this would be comparing apples to oranges. So prior to moving forward, it was important to standardize all the predictors. Here are the graphical results for LR and SVM:
ply.offline.iplot(fig)
ply.offline.iplot(fig_svm)
Judging the two charts, both LR and SVM had identical predictors for its top 4:
Features close to 0 are not strong predictors of rain while the ones closer to the absolute value of 1 are considered strong predictors. This actually makes real world sense. When watching the weather news report, meteorologist often point to this predictors when reporting on weather systems leading up to rain in the forecast. It is nice to see that both LR and SVM models produced similiar results. The remaining predictors in the between had different weights, thus ranked in different order.
In general, It seems intuitive to conclude that variables with larger coefficients (weights) are more important because they point to more significant changes in rain. However, because the features are all in different units, one needs to standardized the regression coefficients. In both cases of LR and SVM, the features are standardized - which represent the mean change in the response given a one standard deviation change in the predictor. The top 4 features listed above have the highest absolute value weights thus making those vairable more important than the rest.
Section Requirements: Look at the chosen support vectors for the classification task. Do these provide any insight into the data? Explain. If you used stochastic gradient descent (and therefore did not explicitly solve for support vectors), try subsampling your data the SVC model - then analyse the support vectors from the subsampled dataset.
When completing the Support Vector Machine analysis, we did not utilize tochastic gradient descent. Although the time required to complete the SVM was longer than the logistic regression, it was not so much that it became unmanagable. With a dataset of a size larger that this one, SGD would have been applied. We did, however, subset our data in the process in order to create a random sample of the data for the test and train set.
The chosen support vectors for some of the variables are shown visually below as Pressure9am, Pressure3pm, and Humidity3pm. The Pressure (9am and 3pm) are very similar to eachother in their prediction. However, when looking at the data, we can see that high pressure in the morning follwed by low pressure during the afternoon significantly contributes to a prediction of rainfall the following day. A high humidity at 3pm also contributes to the likihood of rain tomorrow.
#Make a dataframe of the training data from the last iteration above.
df_tested_on = svm_rainfall.iloc[train_indices]
#Get the support vectors from the trained model.
df_support = df_tested_on.iloc[svm_clf.support_,:]
#Add response variable back to expalantory datasets.
df_support['RainTomorrow'] = y[svm_clf.support_]
svm_rainfall['RainTomorrow'] = y
#Group observations by the response variable, 1's together and 0's together.
df_grouped_support = df_support.groupby(['RainTomorrow'])
df_grouped = svm_rainfall.groupby(['RainTomorrow'])
#Plot kernal desity extimation of variables with weights higher than the user defined value set above, 0.5 in this case.
vars_to_plot = svm_voi
for v in vars_to_plot:
plt.figure(figsize=(10,4))
#Plot support vector stats.
plt.subplot(1,2,1)
ax = df_grouped_support[v].plot.kde()
plt.legend(['no rain','rained'])
plt.title(v+' (Instances chosen as Support Vectors)')
#Plot original distributions
plt.subplot(1,2,2)
ax = df_grouped[v].plot.kde()
plt.legend(['no rain','rained'])
plt.title(v+' (Original)')